Semantic Wrappers for Semi-Structured Data Extraction1
نویسندگان
چکیده
In this paper, we propose an approach to extract information from HTML pages and to add semantic (XML) tags to them. Wrapping is an essential technique used to automatically extract information from Web sources. This paper describes both, a general approach based on rules, which can be used to automatically generate wrappers, and an assistant generator wrapper called WebMantic. We also provide some experimental results to show that both the rule generation process and the preprocessing task are fast and reliable. c © 2008 European Society of Computational Methods in Sciences and Engineering
منابع مشابه
Semantic Wrappers for Semi-Structured Data Extraction
In this paper, we propose an approach to extract information from HTML pages and to add semantic (XML) tags to them. Wrapping is an essential technique used to automatically extract information from Web sources. This paper describes both, a general approach based on rules, which can be used to automatically generate wrappers, and an assistant generator wrapper called WebMantic. We also provide ...
متن کاملAutomatically Regenerating Wrappers for Web Sources Using Results from Previous Queries
A substantial subset of the web data follows some kind of underlying structure. Nevertheless, HTML does not contain any schema or semantic information about the data it represents. A program able to provide software applications with a structured view of those semi-structured web sources is usually called a wrapper. Wrappers are able to accept a query against the source and return a set of stru...
متن کاملSG-WRAP: A Schema-Guided Wrapper Generator
With the development of the Internet, the World-WideWeb has become everyone’s invaluable information source. However, most of data on the Web is currently in the form of HTML pages, which is neither well-structured nor associated with schema. It is almost impossible to use such data efficiently. Web wrapper technology has been developed to transform unstructured /semi-structured data to semi-st...
متن کاملWeb-Scale Extension of RDF Knowledge Bases from Templated Websites
Only a small fraction of the information on the Web is represented as Linked Data. This lack of coverage is partly due to the paradigms followed so far to extract Linked Data. While converting structured data to RDF is well supported by tools, most approaches to extract RDF from semi-structured data rely on extraction methods based on ad-hoc solutions. In this paper, we present a holistic and o...
متن کاملXtractor: A light wrapper for XML paragraph-centric documents
The emergence of XML leads the development of applications centric XML-documents. Often the documents contain tagged paragraphs of natural language texts. The extraction of relevant data from paragraphs confronts with their irregular structure hidden in the text and requires powerful extraction patterns. Although a large spectrum of wrappers has been conceived to mainly process HTML pages, the ...
متن کامل